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Abstract 

Complicated generative models often result 
in a situation where computing the likelihood 
of observed data is intractable, while simulat¬ 
ing from the conditional density given a pa¬ 
rameter value is relatively easy. Approximate 
Bayesian Computation (ABC) is a paradigm 
that enables simulation-based posterior infer¬ 
ence in such cases by measuring the simi¬ 
larity between simulated and observed data 
in terms of a chosen set of summary statis¬ 
tics. However, there is no general rule to con¬ 
struct sufficient summary statistics for com¬ 
plex models. Insufficient summary statistics 
will “leak” information, which leads to ABC 
algorithms yielding samples from an incorrect 
(partial) posterior. In this paper, we pro¬ 
pose a fully nonparametric ABC paradigm 
which circumvents the need for manually se¬ 
lecting summary statistics. Our approach, 
K2-ABC, uses maximum mean discrepancy 
(MMD) to construct a dissimilarity measure 
between the observed and simulated data. 
The embedding of an empirical distribution 
of the data into a reproducing kernel Hilbert 
space plays a role of the summary statistic 
and is sufficient whenever the corresponding 
kernels are characteristic. Experiments on a 
simulated scenario and a real-world biologi¬ 
cal problem illustrate the effectiveness of the 
proposed algorithm. 
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1 Introduction 

ABC is an approximate Bayesian inference framework 
for models with intractable likelihoods. Originated in 
population genetics [T], ABC has been widely used in 
a broad range of scientific applications including ecol¬ 
ogy, cosmology, and bioinformatics [aiail]. The goal 
of ABC is to obtain an approximate posterior distribu¬ 
tion over model parameters, which usually correspond 
to interpretable inherent mechanisms in natural phe¬ 
nomena. However, in many complex models of inter¬ 
est, exact posterior inference is intractable since the 
likelihood function, the probability of observations y* 
for given model parameters 0, is either expensive or im¬ 
possible to evaluat^ This leads to a situation where 
the posterior density cannot be evaluated even up to 
a normalisation constant. 

ABC resorts to an approximation of the likelihood 
function using simulated observations that are similar 
to the actual observations. Most ABC algorithms eval¬ 
uate the similarity between the simulated and actual 
observations in terms of a pre-chosen set of summary 
statistics. Since the full dataset is represented in a 
lower-dimensional space of summary statistics, unless 
the selected summary statistic s* is sufficient, ABC 
results in inference on the partial posterior p{9\s*), 
rather than the desired full posterior p{d\y*). There¬ 
fore, a poor choice of a summary statistic can lead to 
an additional bias that can be difficult to quantify and 
the selection of summary statistics is a crucial step 
in ABC OH]. A number of methods have been pro¬ 
posed for constructing informative summary statistics 
by appropriate transformations of a set of candidate 
summary statistics: a minimum-entropy approach [7], 
regression from a set of candidate summary statistics 
to parameters [5], or a partial least-squares transfor¬ 
mation with boosting laiin]. A review of these meth- 

'^Note that this is different from the intractability of the 
marginal likelihood, which requires integrating out 9 from 
the likelihood function. While marginal likelihood is often 
computationally intractable, this issue is readily resolved 
via a suite of MCMC algorithms. 
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ods is given in [^. However, the obtained summary 
statistics are optimal only with respect to a given loss 
function (e.g., [8] focuses on the quadratic loss) and 
may not suffice for full posterior inference. In addi¬ 
tion, they still heavily depend on the initial choice of 
candidate summary statistics. 

Another line of research, inspired by indirect infer¬ 
ence framework, constructs the summary statistics us¬ 
ing auxiliary models [13 H]. Here, the estimated pa¬ 
rameters of an auxiliary model become the summary 
statistics for ABC. A thorough review of these method 
is given in m- The biggest advantage of this frame¬ 
work is that the dimensionality of the summary statis¬ 
tic can be determined by controlling the complexity 
of the auxiliary model using standard model selec¬ 
tion techniques such as Bayesian information criterion 
(BIC) and Akaike information criterion (AIC). How¬ 
ever, even if the complexity can be set in a principled 
way, one still needs to select the right parametric form 
of the auxiliary model. Furthermore, unless one uses 
an exponential family model, the dimensionality of the 
sufficient statistic will increase with the sample size 
according to the classical Pitman-Koopman-Darmois 
theorem (cf. e.g. |13jl. Thus, for many complex mod¬ 
els, the minimal sufficient statistic is effectively the full 
dataset. 

In this light, we introduce a method that circum¬ 
vents an explicit selection of summary statistics. Our 
method proceeds by applying a similarity measure to 
data themselves, via embedding [Ulli] of the empir¬ 
ical data distributions into an infinite-dimensional re¬ 
producing kernel Hilbert space (RKHS), correspond¬ 
ing to a positive definite kernel function defined on the 
data space. For suitably chosen kernels called char¬ 
acteristic m, these embeddings capture all possible 
differences between distributions, e.g. all the high- 
order moment information that may be missed with 
a set of simple summary statistics. Thus, no infor¬ 
mation loss is incurred when going from the posterior 
given data p{9\y*) to that given the embedding fi{y*) 
of data, p{9\fi{y*)). A flexible representation of prob¬ 
ability measures given by kernel embeddings has al¬ 
ready been applied to nonparametric hypothesis test¬ 
ing m, inference in graphical models |18) and to pro¬ 
posal construction in adaptive MCMC [T9j. The key 
quantity arising from this framework is an easily com¬ 
putable notion of a distance between probability mea¬ 
sures, termed Maximum Mean Discrepancy (MMD). 
When the kernel used is characteristic, a property sat¬ 
isfied by most commonly used kernels including Gaus¬ 
sian, Laplacian and inverse multiquadrics, embeddings 
are injective, meaning that the MMD gives a metric 
on probability measures. Here, we adopt MMD in the 
context of ABC as a nonparametric distance between 


empirical distributions of simulated and observed data. 
As such, there is no need to select a summary statistic 
first and the kernel embedding itself plays a role of a 
summary statistic. In addition to the positive definite 
kernel used for obtaining the estimates of MMD, we 
apply an additional Gaussian smoothing kernel which 
operates on the corresponding RKHS, i.e., on embed¬ 
dings themselves, to obtain a measure of similarity be¬ 
tween simulated and observed data. For this reason, 
we refer to our method as double-kernel ABC, or K2- 
ABC. Our experimental results in section [^ demon¬ 
strate that this approach results in an effective and 
robust ABC method which requires no hand-crafted 
summary statistics. 

The rest of the paper is organised as follows. In sec¬ 
tion]^ we overview classical approaches (rejection and 
soft ABC) as well as several relevant recent techniques 
(synthetic likelihood ABC, semi-automatic ABC, in¬ 
direct score ABC, and kernel ABC) to which we will 
compare our method in section [^ In section [^ we 
introduce our proposed algorithm. Experimental re¬ 
sults including comparisons with methods discussed 
in section are presented in section [^ We explore 
computational tractability of K2-ABC in section 

2 Background 

We start by introducing ABC and reviewing existing 
algorithms. Consider a situation where it is possible 
to simulate a generative model and thus sample from 
the conditional density p{y\9), given a value d £ 0 of 
parameters, while the computation of the likelihood 
p{y*\9) for the observed data y* is intractable. Nei¬ 
ther exact posterior inference nor posterior sampling 
are possible in this case, as the posterior p{9\y*) oc 
Tr{9)p(y*\9), for a prior tt, cannot be computed up to 
a normalizing constant. ABC uses an approximation 
of the likelihood obtained from simulation. 

The simplest form of ABC is rejection ABC. Let 
e > 0 be a similarity threshold, and p be a notion 
of distance, e.g., a premetric on domain y of obser¬ 
vations. The rejection ABC proceeds by sampling 
multiple model parameters 0 ~ tt. For each 9, a 
pseudo dataset y is generated from p{y\9). The pa¬ 
rameters 9 for which the generated y are similar to 
the observed y* , as decided by p{y,y*) < e, are ac¬ 
cepted. The result is an exact sample {9i}^!^-^ from 
the approximated posterior p^{9\y*) cx 7r(0)pe(?/*|d), 
where Pe{y*\9) = p(#)rfj/ and B^{y*) = 

{y : p{y, y*) < e}. Choice of p is crucial for the design 
of an accurate ABC algorithm. Applying a distance 
directly on dataset y is often challenging, when the 
dataset consists of a large number of (possibly multi¬ 
variate) observations. 
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Thus, one resorts to first choosing a summary statis¬ 
tic s(y) and comparing them between the datasets, i.e. 
Piy^y') = \\s{y) — s(2/0ll- Since it is generally difficult 
to construct sufficient statistics for complex models, 
this will often “leak” information, e.g., if s{y) repre¬ 
sents first few empirical moments of dataset y. It is 
only when the summary statistic s is sufficient, that 
this approximation is consistent as e —>■ 0 , i.e. that the 
ABC posterior p^{9\y*) will converge to the full poste¬ 
rior. Otherwise, ABC operates on the partial posterior 
p{9\s{y*)), rather than the full posterior p{9\y*). 

Another interpretation of the approximate likelihood 
p^{y*\9) is as the convolution of the true likeli¬ 
hood p{y\9) and the “similarity” kernel K^{y,y*) = 
l{yGB^{y*)). Being the indicator of the e-ball 
B^{y*) computed w.r.t. p, this kernel imposes a hard- 
constraint leading to the rejection sampling. In fact, 
one can use any similarity kernel parametrised by e 
which approaches delta function 6y* as e —>■ 0. A fre¬ 
quently used similarity kernel takes the form 

Ke{y,y') = exp (y^y) ^ ^ ( 2 ^) 


Hastings accept/reject steps with the acceptance prob¬ 
ability given by a{9'\9) = min 
where q{9\9') is a proposal distribution. 


1 ■7r(e')p(s*|e')i}(e|6>') 


Semi-Automatic ABC (SA-ABC) Under a 
quadratic loss, [ 8 ] shows that the optimal choice of the 
summary statistics is the true posterior means of the 
parameters E[0|j/] - however, these cannot be calcu¬ 
lated analytically. Thus, [5] proposed to use the simu¬ 
lated data in order to construct new summary statis¬ 
tics - estimates of the posterior means of the parame¬ 
ters - by fitting a vector-valued regression model from 
a set of candidate summary statistics to parameters. 
Namely, a linear model 0 = E [d|y]-|-e = Bg{y)+£ with 
the vector of candidate summary statistics g{y) used 
as explanatory variables (these can be simply g{y) = y 
or also include nonlinear functions of y) is fitted based 
on the simulated data {{yi,9i)}^i. Here, assuming 
0 G 0 C and g{y) G K”, B is a d x r matrix of 
coefficients. The resulting estimates s{y) = Bg{y) of 
E [0|j/] can then be used as summary statistics in stan¬ 
dard ABC by setting p{y,y') = ||s(j/) - s(y')ll 2 - 


Such construction would result in a weighted sam¬ 


ple where Wj = 

can be directly utilised in estimating posterior expec¬ 
tations. That is, for a test function /, the expec¬ 
tation Jq f{9)p{9\y*)d9 is estimated using E[f{9)] = 

This is an instance of a soft ABC, 
where parameter samples from the prior are weighted, 
rather than accepted or rejected. 




which 


Synthetic likelihood ABC (SL-ABC) Intro¬ 
duced in [20]) the synthetic likelihood ABC models 
simulated data in terms of their summary statistics 
and further assumes the summary statistics have mul¬ 
tivariate normal distribution, s ~ Eg), with the 

empirical mean and covariance defined by 

M M 

= - fieV, 

i=l i=l 

where Si denotes the vector of summary statistics of 
the ith simulated dataset. Using the following simi¬ 
larity kernel to measure the distance from the sum¬ 
mary statistics of actual observations s*, Ke(s*,s) = 

| 27 re /|~2 exp ) the resulting synthetic like¬ 

lihood is given by 

p{s*\9) = J Ke{s*,s)J\f{s\fig,Ee) ds 
= Af(s*\fig, Eg -f e^/). 

Relying on the synthetic likelihood, SL-ABC algo¬ 
rithm performs MCMC sampling based on Metropolis- 


Indirect score ABC (IS-ABC) Based on an aux¬ 
iliary model PA (?/!</') with a vector of parameters 
(j) = ) ?^dim( 0 )]^) the indirect score ABC uses 

a score vector Sa as the summary statistic m- 

SA{yA) = When 


the auxiliary model parameters </> are set by the 
maximum-likelihood estimate (MLE) fitted with ob¬ 
served data y*, the score for the observed data 
SA{y*Tfl^uLsiy*)) becomes zero. Based on this fact, 
IS-ABC searches for the parameter values whose corre¬ 
sponding simulated data produce a score close to zero. 
The discrepancy between observed and simulated data 
distributions in IS-ABC is given by 


p{s{y),s{y*)) 

=\lsA{y, <(’MLE {y*)VJ (</'MLE (y*) )5 'a (y, ((imle (y*)), 

where J(<(>mle(?/*)) is the approximate covariance ma¬ 
trix of the observed score. 


Kernel ABC (K-ABC) The use of a positive def¬ 
inite kernel in ABC has been explored recently in [21] 
(K-ABC) in the context of population genetics. In 
K-ABC, ABC is cast as a problem of estimating a 
conditional mean embedding operator mapping from 
summary statistics s{y) to corresponding parameters 
9. The problem is equivalent to learning a regression 
function in the RKHSs of s{y) and 9 induced by their 
respective kernels [22] • The training set T needed 
for learning the regression function is generated by 
firstly sampling {{yi,9i)}fii ~ p{y\9)Tr{9) from which 
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T := {(si, by summarising each pseudo dataset 

Ui into a summary statistic Si. 

In effect, given a summary statistic s* corresponding 
to the observations y* , the learned regression function 
allows one to represent the embedding of the poste¬ 
rior distribution in the form of a weighted sum of the 
canonical feature maps {k{-, where k is a kernel 

associated with an RKHS T-Le. In particular, if we as¬ 
sume that fc is a linear kernel (as in EH), the posterior 
expectation of a function / G He is given by 

M 

M 

+ MXI)-%k{sj,s*), 

i=i 

where Gij = g{si, sj), g is a kernel on s, and A is a reg¬ 
ularization parameter. The use of a kernel g on sum¬ 
mary statistics s implicitly transforms s non-linearly, 
thereby increasing the representativeness of s. Never¬ 
theless, the need for summary statistics is not elimi¬ 
nated. 

3 Proposed method 

We first overview kernel MMD, a notion of distance 
between probability measures that is used in the pro¬ 
posed K2-ABC algorithm. 

Kernel MMD For a probability distribution on 
a domain X, its kernel embedding is defined as = 
Ex^F^k{-,X) [13, an element of an RKHS H associ¬ 
ated with a positive definite kernel fc : A x A —>■ M. 
An embedding exists for any F^. whenever the kernel 
k is bounded, or if F^ satisfies a suitable moment con¬ 
dition w.r.t. an unbounded kernel k |23j . For any two 
given probability measures F^ and Fy, their maximum 
mean discrepancy (MMD) is simply the Hilbert space 
distance between their embeddings: 

MMD^(Fa;,Fy) = - g-Fy\\l^ 

=ExEx>k{X, X') + EyEykiY, Y') - 2ExEv/c(X, F), 

where X,X' F^ and Y,Y' Fy. While simple 
kernels like polynomial of order r capture differences 
in first r moments of distributions, particularly inter¬ 
esting are kernels with a characteristic property [16] . 
for which the kernel embedding is injective and thus 
MMD defined by such kernels gives a metric on the 
space of probability distributions. Examples of such 
kernels include widely used kernels such as Gaussian 

^A related notion of universality is often employed. 


RBF and Laplacian. Being written in terms of ex¬ 
pectations of kernel functions allows straightforward 
estimation of MMD on the basis of samples: given 

unbiased es¬ 
timator is given by 


MMD iF,,Fy) = 


TT'x(tT'X — 1 ) 

nyiny - 1) S ^ 




i=i j=i 


Further operations are possible on kernel embeddings 
- one can define a positive definite kernel on proba¬ 
bility measures themselves using their representation 
in a Hilbert space. An example of a kernel on proba- 
bihty measures is KXtx,ty) = exp (- )—1, 

[21] with e > 0. This has recently led to a thread 
of research tackling the problem of learning on distri¬ 
butions, e.g., [23 ■ These insights are essential to our 
contribution, as we employ such kernels on probabil¬ 
ity measures in the design of the K2-ABC algorithm 
which we describe next. 


K2-ABC The first component of K2-ABC is a non- 
parametric distance p between empirical data distri¬ 
butions. Given two datasets y = (?/^^\ ..., and 
y' = consisting of n i.i.d. observa¬ 

tion^ we use MMD to measure the distance between 
y^y'- p‘^{y,y') = MMD {Fy,Fy>), i.e. is an un¬ 
biased estimate of MMD^ between probability distri¬ 
butions Fy and Fyi used to generate y and y'. This 
is almost the same as setting empirical kernel embed¬ 
ding s{y) = pp = ^ ('j y^^^) fo be the summary 

statistic. However, in that case \\s{y) — s(y')\\'^ = 
MMD‘^{F^,Fy) would have been a biased estimate of 
the population MMD ' m- Our choice of p is guar¬ 
anteed to capture all possible differences (i.e. all mo¬ 
ments) between Fy and Fyi whenever a characteristic 
kernel k is employed [16] . i.e. we are operating on a 
full posterior and there is no loss of information due 
to the use of insufficient statistics. 

Further, we introduce a second kernel into the ABC 
algorithm (summarised in Algorithm [^, the one that 
operates directly on probability measures, and com- 


^The i.i.d. assumption can be relaxed in practice, as we 
demonstrate in section]^ on time series data. 
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Algorithm 1 K2-ABC Algorithm 

Input: observed data y*, prior tt, soft threshold e 

Output: Empirical posterior 

for i = 1,... ,M do 


Sample 6i ^ n 

Sample pseudo dataset j/i ^ 


Wi = exp 


MMD 




end for 

= Wt/ Y^f=i Wj for i = 1,..., M 


pute the ABC posterior sample weights, 


KeiFy,Fy>) = exp - 


MMD {Fy,Fy>) 


, ( 2 ) 


with a suitably chosen parameter e > 0. Now, the 
datasets are compared using the estimated similarity 
Ate between their generating distributions. There are 
two sets of parameters in K2-ABC, parameters of ker¬ 
nel k (on original domain) and e in the kernel Ate (on 
probability measures). 

K2-ABC is readily applicable to non-Euclidean input 
objects if a kernel k can be defined on them. Ar¬ 
guably the most common application of ABC is to 
genetic data. Over the past years there have been a 
number of works addressing the use of kernel methods 
for studying genetic data including [26] which consid¬ 
ered genetic association analysis, and m which stud¬ 
ied gene-gene interaction. Generic kernels for strings, 
graphs and other structured data have also been ex¬ 
plored [ 2 H] . 


4 Experiments 

Toy problem We start by illustrating how the 
choice of summary statistics can significantly affect the 
inference result, especially when the summary statis¬ 
tics are not sufficient. We consider a symmetric Dirich- 
let prior TT and a likelihood piylO) given by a mixture 
of uniform distributions as 

tt{9) = Dirichlet(0; 1), 

5 

P{y\^) = X! ^»Uniform(?/; [i - 1, i]). (3) 

i=l 

The model parameters 9 are a vector of mix¬ 
ing proportions. The goal is to estimate E[0|j/*] 
where y* is generated with true parameter 9* = 
[0.25,0.04, 0.33, 0.04, 0.34]T (see Fig. [^). The sum¬ 
mary statistics are chosen to be empirical mean and 
variance i.e. s{y) = (E[j/], V[?/])^. 


We compare three ABC algorithms: K2-ABC, rejec¬ 
tion ABC, and soft ABC. Here, soft ABC refers to an 
ABC algorithm which uses a similarity kernel in Eq. [l] 
withg = 2 andp(?/, 2 /') = ||s(y)-s(A/')ll 2 - ForK2-ABC, 
a Gaussian kernel defined as k{a, b) = exp 

is used where 7 is set to median({||y*(®) — y*^^'^\\}i,j) 
[29| . We test different values of e on a coarse grid, and 
report the estimated E[ 0 | 2 /*] which is closest to 9* as 
measured with a Euclidean distance. 

The results are shown in Fig. where the top row 
shows the estimated E[ 0 | 2 /*] from each method, asso¬ 
ciated with the best e as reported in the third row. The 
second row of Fig. from left to right, shows y* and 
400 realizations of y drawn from p(?/|E[0|y*]) obtained 
from the three algorithms. In all cases, the mean and 
variance of the drawn realizations match that of y*. 
However, since the first two moments are insufficient 
to characterise p{y\9*)^ there exists other 9' that can 
give rise to the same s{y'), which yields inaccurate 
posterior means shown in the top row. In contrast, 
K2-ABC taking into account infinite-dimensional suf¬ 
ficient statistic correctly estimates the posterior mean. 

Ecological dynamic systems As an example of 
statistical inference for ecological dynamic systems, 
we use observations on adult blowfly populations over 
time introduced in |5D|. The population dynamics are 
modelled by a discretised differential equation: 

Nt+i = PNt-r exp e* -f Nt exp{-5et), 

where an observation at time t -I-1 is denoted by Nt+i 
which is determined by time-lagged observations Nt 
and Nt-T as well as Gamma distributed noise realisa¬ 
tions e* ~ Gam(^,(Tp) and et ~ Gam(^,CT^). Here, 

P d 

the parameters are 9 = {P,No,ad,(Tp,T,S}. We put 
broad Gaussian priors on log of parameters as shown 
in Fig. §4. Note that the time series data given the pa¬ 
rameters drawn from the priors vary drastically (see 
Fig. [^), and therefore inference with those data is 
very challenging as noted in |30] . 

The observation (black trace in Fig. §3) is a time- 
series of length T = 180, where each point in time in¬ 
dicates how many flies survive at each time under food 
limitation. For SL-ABC and K-ABC, we adopted the 
custom 10 summary statistics used in [30] : the log of 
the mean of all 25% quantiles of {At/1000}^i (four 
statistics), the mean of 25% quantiles of the first-order 
differences of {iVt/ 1000 }^;^ (four statistics), and the 
maximal peaks of smoothed {Nt}J^i, with two differ¬ 
ent thresholds (two statistics). For IS-ABC, following 
m, we use a Gaussian mixture model with three com¬ 
ponents as an auxiliary model. In addition, we ran 
two versions of SA-ABG algorithm on this example: 
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B. K2-ABC Rejection-ABC Soft-ABC 
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Figure 1: A possible scenario for ABC algorithms to fail due to insufficient summary statistics. A: Using 
the 5-dimensional true parameters (top), 400 observations (bottom) are sampled from the mixture of uniform 
distributions in Eq. B (top): Estimated posterior mean of parameters from each method. We drew 1000 
parameters from the Dirichlet prior and 400 simulated data points given each parameter. In rejection and 
soft ABC algorithms, we used empirical mean and variance of observations as summary statistics to determine 
similarity between simulated and observed data. B (middle): Histograms of 400 simulated data points given 
estimated posterior means by each method. Though the mean and variance of simulated data from rejection and 
soft ABC match that of the observed data, the shapes of the empirical distributions notably differ. B (bottom): 
Euclidean distance between true and estimated posterior mean of parameters as a function of e. We varied the 
e values to find the optimal range in terms of the Euclidean distance. The magnitude of e is algorithm-specific 
and not comparable across methods. 


SAQ regresses to simulated parameters from the corre¬ 
sponding simulated instances of time-series appended 
with the quadratic terms , i.e., g{y) = {y,y‘^) S 
whereas SA-custom uses the above custom 10 sum¬ 
mary statistics from [3D] appended with their squares 
as the candidate summary-statistics vector g{y) (which 
is thus 20-dimensional in this instance)]^ 

For setting e and kernel parameters in K2-ABC, we 
split the data into two sets: training (75% of 180 data 
points) and test (the rest) sets. Using the training 
data, we ran each ABC algorithm given each value of 
e and kernel parameters defined on a coarse grid, then, 
computed test erroij^to choose the optimal values of e 
and kernel parameters in terms of the minimum pre¬ 
diction error. Finally, with the chosen e and kernel 

"^For SL-ABC, we used the Python implementation by 
the author of [3^. For IS-ABC, we used the MATLAB 
implementation by the author of [10]. For SA-ABC, we 
used the R package abctools written by the author of [8]. 

®We used Euclidean distance between the histogram 
(with 10 bins) of test data and that of predictions made by 
each method. We chose the difference in histogram rather 
than in the realisation of y itself, to avoid the error due to 
the time shift in y. 


parameters, we ran each ABC algorithm using the en¬ 
tire data. 

We show the concentrated posterior mass after per¬ 
forming K2-ABC in Fig. [2^, as well as an example 
trajectory drawn from the inferred posterior mean in 
Fig. [ 2 J 3 1^ To quantify the accuracy of each method, 
we compute the Euclidean distance between the vec¬ 
tor of chosen 10 summary statistics s* = s{y*) for the 
observed data and s[y) for the simulated data y given 
the estimated posterior mean 0 of the parameters. As 
shown in Fig. [^ K2-ABC outperforms other meth¬ 
ods, although SL-ABC, SA-custom, and K-ABC all 
explicitly operate on this vector of summary statistics 
s while K2-ABC does not. In other words, while those 
methods attempt to explicitly pin down the parts of 
the parameter space which produce summary statistics 
s similar to s*, insufficiency of these summary statis¬ 
tics affects the posterior mean estimates undesirably 
even with respect to that very metric. 


®Note that reproducing the trajectory exactly is not the 
main goal of this experiment. We show the example tra¬ 
jectory to give the readers a sense of what the trajectory 
looks like. 
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Figure 2: Blowfly data. A (top): Histograms of 10,000 samples for four parameters {logP, logiVo,log(Tp,logr} 
drawn from the prior. A (middle/bottom): Histogram of samples from the posteriors obtained by K2-ABC 
/ SL-ABC (acceptance rate: 0.2, burn-in: 5000 iterations), respectively. In both cases, the posteriors over 
parameters are concentrated around their means (black bar). The posterior means of P and r obtained from 
K2-ABC are close to those obtained from SL-ABC, while there is noticeable difference in the means of Nq and 
CTp. Note that we were not able to show the same histogram for K-ABC since the posterior obtained by K-ABC 
is improper. B (top): Three realisations of y given three different parameters drawn from the prior. Small 
changes in 0 drastically change y. B (middle to bottom): Simulated data using inferred parameters (posterior 
means) shown in A. Our method (in red) produces the most similar dynamic trajectory to the actual observation 
(in black) among all the methods. 
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Figure 3: Euclidean distance between the chosen 10 
summary statistics of y* and y given the posterior 
mean of parameters estimated by various methods. 
Due to the fluctuations in y from the noise realisa¬ 
tions (et, et), we made 100 independent draws of y and 
computed the Euclidean distance. Note that K2-ABC 
achieved nearly 50% lower errors than the next best 
method, SL-ABC, although SL-ABC, SA-custom, and 
K-ABC all explicitly operate on the summary statis¬ 
tics in the comparison while K2-ABC does not. 


5 Computational tractability 

In K2-ABC, given a dataset and pseudo dataset 
with n observations each, the cost for computing 

MMD {Fy.,Fyt) is O(n^). For M pseudo datasets, 
the total cost then becomes 0{Mn^), which can be 
prohibitive for a large number of observations. Since 
computational tractability is among the core consider¬ 
ations for ABC, in this section, we examine the per¬ 
formance of K2-ABC with different MMD approxima¬ 
tions which reduce the computational cost. 


Linear-time MMD The unbiased linear-time 
MMD estimator presented in section 6] reduces 
the total cost to 0{Mn) at the price of a higher vari¬ 
ance. Due to its computational advantage, the linear¬ 
time MMD has been successfully applied in large-scale 
two-sample testing [ST] as a test statistic. The original 
linear-time MMD is given by 


MMD, {F,,Fy) 
9 r 

n 

1—1 ^ 
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Note that we have assumed the same number of ob¬ 
servations Ux = riy = n from Fx and Fy. The esti¬ 
mator MMD; is constructed so that the independence 
of the summands allows derivation of its asymptotic 
distribution and the corresponding quantile computa¬ 
tion needed for two-sample testing. However, since 
we do not require such independence, we employ a 
linear-time estimator with a larger number of sum¬ 
mands, which also does not require Ux = ny With¬ 
out loss of generality, we assume Ux < Uy Denote 

X^^'> := j ^ j .^g 

cyclic shift through the smaller dataset The 

linear-time MMD estimator that we propose is 


__2 ^ 1 

MMDL{Fx,Fy) = -- V 

FLy 1 


which is an unbiased estimator of MMD^(F 2 ,,F^). 
The total cost of the resulting K2-ABC algorithm is 

0{M{nx + riy)). 


MMD with random Fourier features Another 
fast linear MMD estimator can be achieved by consid¬ 
ering an approximation to the kernel function k{x, y) 
with an inner product of finite dimensional feature vec¬ 
tors (j){x)^^{y) where (j){x) e and D is the number 
of features. Given the feature map (/)(•) such that, 
k{x,y) « (j){x)^(j){y), MMD^ can be approximated as 

MMDlf{Fx,Fy) 

xEx^{X)^Ex4iX')+EY$iYyEY4{Y') 

- 2Ex<^(X)^Ey0(y) 

:= \\ ExHX ) -^ yHY )\\1 


A straightforward (biased) estimator is 


M^%{Fx,Fy) 


1 


2 = 1 


n 


which can be computed in 0{D{nx + ny)), i.e., lin¬ 
ear in the sample size, leading to the overall cost of 

0{MD{nx + Uy)). 

Given a kernel fc, there are a number of ways to ob¬ 
tain <()(•) such that k{x, y) « (j){x)^(j){y). One approach 
which became popular in recent years is based on ran¬ 
dom Fourier features [32] which can be applied to any 
translation invariant kernel. Assume that k is transla¬ 
tion invariant i.e., k{x,y) = k(x — y) for some function 
k. According to Bochner’s theorem [33], k can be writ¬ 
ten as 

k{x-y) = j dA(a;) 

= cos(a;^(x - y)) 

= 2 Eb.^c/[o, 27 r]E<.j^A cos(a;^x + b) cos{uj^y + b), 


where i = and due to positive-definiteness of 

k, its Fourier transform A is nonnegative and can be 
treated as a probability measure. By drawing ran¬ 
dom frequencies {uJi}!jL-^ ~ A and ~ t7[0, 27r], 

k{x — y) can be approximated with a Monte Carlo av¬ 
erage. It follows that 4>j{x) = 2/D cos{ujJX + bj) 

and (j){x) = {^i{x),... ,(j)D{x)y^. Note that a Gaus¬ 
sian kernel k corresponds to normal distribution A. 

Empirical results We employ the linear-time and 
the random Fourier feature MMD estimators in our 
K2-ABC algorithm, which we call K2-lin and K2-rf, re¬ 
spectively, and test these variants on the blowfly data. 
For K2-rf, we used 50 random features. 


3 



K2 K2-rf K2-lin 


: Figured: K2-ABC 

— with different 

MMD estima- 
== tors outperform 

-h the best existing 

method, SL-ABC, 
on the blowfly 
SL data. 


6 Conclusions and further work 

We investigated the feasibility of using MMD as a dis¬ 
crepancy measure of samples from two distributions 
in the context of ABC. Via embeddings of empirical 
data distributions into an RKHS, we effectively take 
into account infinitely many implicit features of these 
distributions as summary statistics. When tested on 
both simulated and real-world datasets, our approach 
obtained more accurate posteriors, compared to other 
methods that rely on hand-crafted summary statistics. 

While any choice of a characteristic kernel will guar¬ 
antee infinitely many features and no information loss 
due to the use of partial posteriors, we note that the 
kernel choice is nonetheless important for MMD esti¬ 
mation and therefore also for the efficiency of the pro¬ 
posed K2-ABC algorithm. As widely studied in the 
RKHS literature, the choice should be made to best 
capture characteristics of given data, i.e., by utilising 
domain-specific knowledge. For instance, when some 
data components are believed a priori to be on differ¬ 
ent scales, one can adopt the automatic relevance de- 
temination (ARD) kernel instead of the Gaussian ker¬ 
nel. Formulating explicit efficiency criteria in the con¬ 
text of ABC and optimizing over kernel choice, simi¬ 
larly as in the context of two-sample testing |31| , would 
be an essential extension. 
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